International Journal of Medical Informatics — Latest Matching Preprints

1

Sentiment in Clinical Notes: A Predictor for Length of Stay?

Boyne, A.; Feygin, M.; Sholeen, J.; Zimolzak, A.

2026-03-18 health informatics 10.64898/2026.03.16.26348553 medRxiv

Top 0.1%

16.9%

Show abstract

BackgroundLength of stay (LOS) is a critical metric for hospital operational efficiency. While structured clinical data is widely used to predict LOS, unstructured admission notes contain latent prognostic information regarding diagnostic uncertainty and disease complexity. This study evaluates the efficacy of extracting sentiment and direct LOS estimates from admission notes to predict patient hospitalization duration. MethodsWe conducted a retrospective study of 4,503 adult patients admitted with community-acquired pneumonia between 2013 and 2023. Admission history and physical notes were preprocessed and filtered to extract physician-generated narratives. We evaluated four natural language processing models, VADER, TextBlob, Longformer, and an open-source large language model (GPT-oss-20B), to generate zero-shot sentiment scores. Additionally, GPT-oss-20B was prompted to directly estimate LOS. Model outputs were correlated with actual LOS using linear regression and Pearson correlation coefficients. ResultsSentiment models demonstrated statistically significant, albeit weak, correlations with actual LOS. Longformer achieved the highest variance explained among sentiment classifiers (R2 = 0.019). Direct LOS estimation by the LLM outperformed sentiment-based approaches, demonstrating the strongest correlation with actual hospital duration (r = -0.218, p < 0.001). Model agreement was generally poor (ICC = 0.059), and computational time varied drastically, from 2.6 seconds per 100 notes (TextBlob) to over 370 seconds (GPT-oss-20B). ConclusionZero-shot sentiment analysis of clinical notes yields a small but measurable correlation with LOS, limited primarily by the objective, non-evaluative nature of clinical documentation. Direct LLM estimation of clinical outcomes outperforms emotional sentiment extraction. Future predictive systems should integrate computationally efficient NLP models capable of capturing latent clinical complexity alongside established structured data variables.

2

Development of a natural language processing application to extract and categorize mentions of violence from mental healthcare records text

Li, L.; Sondh, S.; Sondh, H. K.; Stewart, R.; Roberts, A.

2026-03-26 health informatics 10.64898/2026.03.22.26348435 medRxiv

Top 0.1%

14.1%

Show abstract

BackgroundExperiences of violence are reported frequently by mental health service users, victims of violence are at a greater risk of mental health disorders, and violence may sometimes occur as a consequence of a mental disorder. Electronic health records (EHRs) are an important source of information about healthcare, and its social context. Occurrences of violence are not routinely recorded as structured data in EHRs but are however recorded in the free text narrative. ObjectiveOur objective was to address this research gap by creating a natural language processing (NLP) application that extracts information related to various forms of violence (physical (non-sexual), sexual, emotional, and financial) from the EHR of a large south London mental health service. Additionally, we aimed to extract features concerning the patients role (victimization vs. perpetration), timing (recent vs. historic), domestic context, presence (actual, threat, or unclear), and polarity (affirmed, abstract, or negated) of the violent behaviors. MethodsTwo raters independently annotated 6,500 randomly selected segments of clinical notes containing violence-related keywords from a large mental healthcare provider in South London, each containing 400 characters (with approximately 200 characters before and after the keyword) after rigorous training using a pre-defined and approved coding book provided by senior professionals. We utilized 90% of the annotated data for fine-tuning a multi-label BERT model (employing 5-fold cross-validation) with the remaining 10% of data reserved for a blind test. ResultsThe model performed well on the blind test set for emotional violence (F1= 0.89), financial violence (0.88), physical (non-sexual) violence (0.84), and unspecified violence (0.81), and the patient role (0.89 as perpetrator; 0.84 as victim), polarity (0.89 for affirmed behavior), presence (0.95 for actual violence), and domestic settings (0.88). We were unable to achieve satisfactory results in capturing temporal aspects (0.65 for past violence). ConclusionsWe were able to improve substantially on previously developed NLP for ascertaining violence in routine mental health records, providing novel opportunities for both surveillance and research.

3

Predicting Graduation in Undergraduate Medical Education: A Machine Learning Analysis Across Diverse High School Curricula

Mohamadeya, J.; Khamis, A.; Alsuwaidi, L.; Azar, A.

2026-03-09 medical education 10.64898/2026.03.07.26347831 medRxiv

Top 0.1%

12.8%

Show abstract

BackgroundThe United Arab Emirates (UAE) is characterised by a diverse educational landscape, where students enter medical school from various high school curricula. Understanding how these varied academic backgrounds influence medical students academic performance is essential. The transition to medical school is a critical phase, with graduation outcomes carrying important implications for both students and institutions. Identifying early predictors of success is crucial to improving student support and academic outcomes in undergraduate medical education. AimThis study aimed to evaluate the predictive value of high school curriculum type on graduation outcomes in an undergraduate medical education program. MethodsA retrospective cohort study was conducted on undergraduate medical students enrolled at Mohammed Bin Rashid University of Medicine and Health Sciences (MBRU), Dubai Health, Dubai, UAE, from its inception in 2016 through 2024. The data were accessed for this research on 04/06/2024. The study employed machine learning methods, including Bayesian Networks (BN), Neural Networks (NN), and Random Forests (RF), to evaluate the predictive power of high school curriculum type and other academic variables for graduation success. ResultsThe study included 661 undergraduate medical students, predominantly female, 76.7% (n=507). Students represented 11 high school curricula, with the American (48.1%) and British (22.7%) systems being the most common. Among 122 students eligible to graduate, the Bayesian Network model demonstrated the highest predictive accuracy (AUC = 0.94). The cumulative GPA was the most influential predictor. The model correctly identified 269 out of 494 students (54.5%) as likely to graduate. ConclusionThe type of high school curriculum alone is not a strong predictor of graduation success. Academic performance during medical school and providing targeted support for students from diverse educational backgrounds are more robust predictors. Advanced predictive modelling holds promise for educational research and institutional policy development.

4

Bias in respiratory diagnoses by Large Language Models (LLMs) in Low Middle Income Countries (LMICs)

Mouelhi, A.; Patel, K.; Kussad, S.; Ojha, S.; Prayle, A. P.; LMIC Medical AI Alignment Group,

2026-03-03 health informatics 10.64898/2026.03.02.26347405 medRxiv

Top 0.1%

12.4%

Show abstract

IntroductionClinicians and patients are likely to increasingly use Large Language Models (LLMs) for diagnostic support. Use of LLMs mostly created in North America and Europe, could lead to a High-Income Country bias if used in Low- and Middle-Income Country (LMIC) healthcare settings. We aimed to explore if diagnostic suggestions made by LLMs are relevant in LMIC settings. MethodsFive short respiratory clinical vignettes were produced. For each vignette, a group of doctors from one of 5 countries (Ghana, India, Jordan and Brazil and the UK) independently gave the 4 most likely diagnoses. 4 LLMs (ChatGPT, Claude Sonnet, Google Gemini and Microsoft Copilot) were prompted with the same vignettes. The top 4 diagnoses for each case was requested. A Virtual Private Network (VPN) was used to access the LLM from each of the 4 countries, and in a second experiment the LLM was given the same vignettes but also informed of the country in which the case was based in the prompt. The diagnoses presented by the LLMs was compared with the doctors diagnoses for the LMICs and also compared to the UK. Results106 unique diagnoses were offered by 21 doctors, and 53 by LLMs with a VPN. The LLMs proposed fewer of the doctors diagnoses in LMICs versus in the UK - 50% (95% CI 32.6 to 67.4%) in the UK compared to 32.0% (95% CI 23.1 to 42.3%) in LMICs. This effect persisted when the LLM was informed of the location of the doctor in the prompt. Overall, LLMs performed worse in the LMIC setting (Chi-squared p = 0.028). ConclusionDoctors working in LMICs consider a wider range of diagnoses than LLMs, even when LLMs are queried from that country, or informed that they are in that country. LLMs appear to show a bias when considering likely diagnosis likely related to the epidemiology of high income countries.

5

Extracting patient reported cannabis use and reasons for use from electronic health records: a benchmarking study of large language models

Wang, Y.; Bozkurt, S.; Le, N.; Alagappan, A.; Huang, C.; Rajwal, S.; Lewis, A.; Kim, J.; Falasinnu, T.

2026-03-09 health informatics 10.64898/2026.03.06.26347824 medRxiv

Top 0.1%

10.2%

Show abstract

ObjectiveTo develop and evaluate a scalable and reproducible natural language processing (NLP) approach using large language models (LLM), to identify cannabis use status and reasons for cannabis use among patients with autoimmune rheumatic diseases (ARDs) from unstructured electronic health record (EHR) clinical notes. Methods and AnalysisWe conducted a retrospective study using EHR clinical notes from patients with ARDs (2015-2024). Notes were screened for cannabis-related mentions using fuzzy string matching against a curated keyword lexicon with a similarity threshold of 90, extracting 50-word context windows ({+/-}25 words). Two domain experts annotated 886 randomly sampled snippets across four classes: (1) not a true cannabis mention/uncertain, (2) denial of use, (3) positive past use, and (4) positive current use. Using these annotations, we compared multiple LLM prompting strategies (zero-shot to few-shot; temperature tuning) and a fine-tuned clinical model (GatorTron 345M). For "reason for use," 1,027 snippets were annotated into six categories: pain, nausea, sleep, anxiety/stress/mood, appetite, and not mentioned/unknown. Models were evaluated on a held-out validation set using accuracy, F1, recall, and precision. We then aggregated snippet-level predictions to patient level to describe temporal trends and subgroup differences. ResultsFor cannabis use status classification, the fine-tuned GatorTron model achieved the highest performance (accuracy 0.90; F1 0.91; recall 0.90; precision 0.90). For the reason of cannabis use classification, gpt-oss-20B achieved the highest performance (accuracy 0.77; F1 0.77; recall 0.77; precision 0.86). Patient-level analyses characterized trends in documented cannabis use from 2015-2024 and compared clinical characteristics between current users and patients denying use. ConclusionHigh-precision extraction of cannabis use status and reasons for use from EHR notes is feasible using a combination of fine-tuned clinical language models and LLM-based classifiers. This approach enables scalable measurement of patient-reported symptom self-management strategies in ARDs, supporting observational research and potential clinical decision support.

6

Can Machine Learning Algorithms use Contextual Factors to Detect Unwarranted Clinical Variation from Electronic Health Record Encounter Data during the Treatment of Children Diagnosed with Acute Viral Pharyngitis

mcowiti, a. O.; Neaimeh, Y. R.; Gu, J.; Lalani, Y.; Newsome, T. C.; nguyen, Y. H.; Shrager, S.; Rasmy, L. O.; Fenton, S. H.

2026-03-02 health informatics 10.64898/2026.02.23.26346757 medRxiv

Top 0.1%

10.2%

Show abstract

Rationale, Aims and ObjectivesUnwarranted clinical variation (UCV) in patient care often arises from contextual factors and contributes to increased costs, unnecessary treatments, and deviations from evidence-based practice. Detecting UCV is challenging due to the complexity of care decisions. Current approaches rely on centralized data aggregation and mixed-effects regression, which estimate relative variation but cannot detect absolute variation. Moreover, machine learning (ML) methods leveraging contextual factors for UCV detection are lacking. The objective is to demonstrate the feasibility of ML for identifying absolute UCV using contextual features extracted from electronic health records (EHR) and identify the factors correlated with UCV in treating acute viral pharyngitis in children. MethodsWe conducted a retrospective study of pediatric ambulatory visits (ICD-10 J02.8) at an academic health system. The use case focused on unwarranted antibiotic prescriptions for acute viral pharyngitis. We trained ensemble ML models--Random Forest, CatBoost, and Explainable Boosting Machine (EBM)--using encounter-level EHR data. Performance was evaluated using nested cross-validation and AUC metrics. We also compared CatBoost models trained on curated (gold-standard) versus weak labels. ResultsAll three ML models demonstrated robust performance, with a median AUC of 0.91, using data from 24 clinics, 81 providers, and 122 patients within an academic health system. CatBoost models trained on weak labels exhibited performance comparable to those trained on gold-standard labels. Feature importance analysis indicated that site-level and provider-level case volumes were the most influential predictors, followed by provider credential, years of experience, and encounter type. Notably, lower provider case volumes were associated with a reduced likelihood of inappropriate treatment. ConclusionsClassical ML models can effectively detect absolute UCV using contextual EHR features. Explainable models such as EBM offer interpretability critical for clinical adoption. These findings support ML-based approaches as scalable alternatives to traditional statistical methods for UCV detection without requiring centralized data analysis.

7

Medical Students' Use of Large Language Models: A National Survey

Barr, A. A.; Rozman, R. C.; Liu, K.; Pham, M.; Klarenbach, Z.; Chinna-Meyyappan, A.; Hassan, A. Y.; Zarychta, M.; El Ferri, O.; Al-Khaz'Aly, A.; Datt, P.; Herik, A. I.; Sadek, K.; Paget, M.; Holodinsky, J. K.

2026-01-29 medical education 10.64898/2026.01.26.26344898 medRxiv

Top 0.1%

10.2%

Show abstract

BackgroundLarge language models (LLMs) are increasingly embedded in medical education and clinical care settings, yet limited empirical data describe medical students in Canadas use and perceptions of these tools. We aimed to characterize student engagement including LLMs used, frequency, purposes, trust, accuracy, perceived impacts, and attitudes toward educational and clinical integration. MethodsWe conducted a national survey of medical students in Canada distributed between November and December 2025. We summarized responses using descriptive statistics and compared results between students in preclerkship versus clerkship using Fishers exact test. ResultsAmong 286 respondents from 10 medical schools, 96.50% reported using at least one LLM. The most commonly used LLMs were ChatGPT (93.36%) and OpenEvidence (57.69%). Daily/weekly use was most frequent for coursework assistance (60.22%) and clinical questions (57.14%). Most respondents reported positive impacts on efficiency (81.62%), learning (77.01%), and academic performance (59.49%). Students commonly reported encountering inaccurate information (90.18%). Formal instruction on LLM use was uncommon (10.95%), though 67.67% of students agreed medical schools should integrate formal instruction on LLMs. Only 21.43% of respondents felt adequately educated on data privacy regulations applicable to these tools. ConclusionsLLM use among surveyed medical students in Canada was nearly universal and perceived favourably. However, students reported exposure to inaccurate outputs and substantial gaps in formal training and privacy literacy. These findings support the development of structured curricular guidance on appropriate application of these tools, including information verification practices and ethical, privacy-aware engagement.

8

AI-Driven Feature Selection Using Only Survey Variable Descriptions: Large Language Models Identify Adolescent Vaping Predictors

Zhang, K.; Zhao, Z.; Hu, Y.; Le, T.

2026-03-09 health informatics 10.64898/2026.03.06.26347816 medRxiv

Top 0.1%

10.1%

Show abstract

ObjectiveTo evaluate the effectiveness of various Large Language Models (LLMs) in identifying reliable predictors of Electronic Nicotine Delivery Systems (ENDS) initiation among adolescents, using solely large-scale survey variable descriptions. MethodsA cohort of 7,943 tobacco-naive adolescents aged 12-16 years from the Population Assessment of Tobacco and Health (PATH) Study was analyzed to predict ENDS use at wave 5. Four instruction-tuned LLMs - GPT-4o, LLaMA 3.1-70B, Qwen 2.5-72B-Instruct, and DeepSeek-V3 - were systematically evaluated for text-based feature selection using only variable descriptions from wave 4.5. Selected features were used to train LightGBM classifiers, with model performance compared to a baseline. ResultsOur findings reveal notable consistency among the four instruction-tuned LLMs, with substantial overlap in the top predictors each model identified. These selected variables spanned critical domains such as peer and household influence, risk perception, and exposure to tobacco-related cues. LightGBM classifiers trained on PATH wave 4.5-5 data using features selected by the LLMs demonstrated strong predictive performance. Notably, Qwen 2.5-72B-Instruct achieved an AUC of 0.791 with 30 predictors, surpassing the baseline AUC of 0.768. DiscussionThe substantial overlap among the top predictors identified by different LLMs suggests a shared reasoning process, despite variations in model architecture and training. LightGBM classifiers trained on these LLM-selected features achieved performance comparable to, or exceeding, models trained on the full set of survey variables, underscoring the high quality of features selected solely from textual descriptions. Moreover, these findings are consistent with previous tobacco regulatory research, further validating the effectiveness of LLM-driven feature selection. ConclusionInstruction-tuned large language models can effectively perform text-based feature selection using survey variable descriptions alone, without accessing raw survey data. This scalable, interpretable, and privacy-preserving framework holds promise for behavioral health research and tobacco use surveillance.

9

Machine Unlearning for GDPR Right-to-Erasure in Antimicrobial Resistance Prediction Models

Saniya, S.; Khan, A. A.

2026-03-10 health informatics 10.64898/2026.03.09.26347960 medRxiv

Top 0.1%

9.7%

Show abstract

ObjectiveHealthcare machine learning models trained on patient data must comply with the General Data Protection Regulation (GDPR) right-to-erasure requirement, which mandates the removal of individual data contributions from deployed models. Full retraining, the current standard, is computationally expensive. This study evaluates Sharded, Isolated, Sliced and Aggregated (SISA) training as an efficient framework for predicting antimicrobial resistance (AMR). Materials and MethodsSISA training (5 shards) was compared with Full Retraining, Label-Flip Retraining, Influence Reweighting, and Selective Tree Pruning on two datasets: the Antibiotic Resistance Microbiology Dataset (ARMD; n = 1,245,767 EHR records) and the BV-BRC/PATRIC genomic surveillance dataset (n = 400,372). Random Forest classifiers used 500 estimators. Metrics included accuracy, AUC-ROC, membership inference attack (MIA) gap, unlearning time, and cumulative 12-month deletion cost. ResultsSISA achieved an 8.9x speedup over full retraining on ARMD (7.5 s vs. 66.7 s) and a 9.8 x speedup on PATRIC (1.4 s vs. 13.4 s), with accuracy costs of 0.024% and 0.048%, respectively, both below the 0.5% clinical threshold. Label-Flip Retraining and Influence Reweighting provided no speedup ([≤] 1.0 x), while Tree Pruning exceeded the threshold on EHR data (+0.648%). Over 12 months at 50 monthly deletions, SISA reduced cumulative overhead from 800 s to 90 s (ARMD) and from 160 s to 16 s (PATRIC). DiscussionSISA maintains predictive performance while reducing computational cost, supporting machine unlearning for regulatory compliance in clinical ML systems. ConclusionSISA provides an efficient framework for maintaining GDPR-compliant AMR prediction models and support the scalable processing of patients deletion requests.

10

Identify Patients at Risk of HIV Using a Clinical Large Language Model from Electronic Health Records

Liu, Y.; Chen, Z.; Suman, P.; Cho, H.; Prosperi, M.; Wu, Y.

2026-04-23 hiv aids 10.64898/2026.04.21.26351427 medRxiv

Top 0.1%

9.1%

Show abstract

This study developed a large language model (LLM)-based solution to identify people at HIV risk using electronic health records. We transformed structured EHR data, including demographics, diagnoses, and medications, into narrative descriptions ordered by visit date and applied GatorTron, a widely used clinical LLM trained on 82 billion words of de-identified clinical text. We compared GatorTron with traditional machine learning models, including LASSO and XGBoost. We identified a cohort with 54,265 individuals, where only 3,342 (6%) had new HIV diagnoses. Our LLM solution, based on GatorTron, achieved excellent performance, reaching an F1 score of 53.5% and an AUC of 0.88, comparable to traditional machine learning approaches. Subgroup analysis showed that, across age, sex, and race/ethnicity groups, both LLM and traditional models achieved AUCs above 0.82. Interpretability analyses showed broadly consistent patterns across LLM models and traditional machine learning models.

11

Predictors of COVID-19 hospital outcomes: a machine learning analysis of the National COVID Cohort Collaborative

Vazquez, J.; Taylor, L.; Chen, Y.-Y. K.; Araya, K.; Farnsworth, M. G.; Xue, X.; Hasan, M.; N3C Consortium,

2026-03-09 health informatics 10.64898/2026.03.06.26347822 medRxiv

Top 0.1%

9.1%

Show abstract

Predicting hospital outcomes for patients with severe acute respiratory infections is critical for risk stratification and resource planning, yet heterogeneous electronic health record (EHR) data, class imbalance, and evolving clinical practice present persistent methodological challenges for machine learning (ML) approaches. We conducted a retrospective cohort study using EHR data harmonized to the OMOP common data model from the National COVID Cohort Collaborative (N3C; May 2020-June 2025), including 263,619 adults hospitalized with COVID-19 across 51 contributing sites. We developed penalized linear regression (elastic net), random forest, XGBoost, and multilayer perceptron (MLP) models to predict hospital length of stay (LOS) and mortality (in-hospital and 60-day), using demographics, comorbidities, prior healthcare utilization, COVID-19 vaccination status, and hospital site as predictors. Missing data were handled via multiple imputation by chained equations (MICE) and class imbalance was addressed using SMOTE. Model performance was evaluated using area under the ROC curve (AUROC), Brier score, calibration plots, and decision curve analysis, following the TRIPOD reporting framework. Mortality prediction achieved moderate discrimination across all models (test AUROC = 0.71-0.73 for in-hospital mortality; 0.72-0.73 for 60-day all-cause mortality). Models trained without SMOTE achieved the highest AUROCs but assigned virtually no patients to the mortality class at the default 0.5 threshold. SMOTE improved recall and F-1 score at the cost of reduced AUROC and precision. LOS was poorly explained by available structured predictors (best R2 = 0.059). Remdesivir-treated patients (n = 103,536; 39.3%) were older, had higher comorbidity burden, and had higher unadjusted mortality than untreated patients. Common structured EHR features offer moderate utility for mortality risk stratification in hospitalized COVID-19 patients but are insufficient for LOS prediction. The consistent SMOTE-related tradeoff between discrimination and calibration underscores the need to report threshold-dependent metrics alongside AUROC in clinical ML studies, with implications for operational planning during future respiratory disease emergencies.

12

Development and Temporal Evaluation of Multimodal Machine Learning Models to Predict High Inpatient Opioid Exposure

Kale, S.; Singh, D.; Truumees, E.; Geck, M.; Stokes, J.

2026-04-02 health informatics 10.64898/2026.03.31.26349842 medRxiv

Top 0.1%

9.1%

Show abstract

High inpatient opioid exposure is associated with increased risk of persistent opioid use. Early identification of high-risk patients may improve opioid stewardship. We developed machine learning models to predict high opioid exposure during hospitalization using electronic health record data from MIMIC-IV. We conducted a retrospective study of 223,452 unique first hospital admissions in MIMIC-IV. The outcome was high opioid exposure, defined as the top decile among opioid-exposed admissions (MME/day [≥] 225), representing 2.65% of all admissions. Structured early-admission features included demographics, admission characteristics, laboratory utilization and abnormality summaries, and 24-hour procedural indicators. Discharge-note data were incorporated using ClinicalBERT embeddings and interpretable bigram features. Models were trained using an 80/10/10 split and evaluated with temporal validation on the most recent 10% of admissions. Performance was assessed using ROC-AUC and PR-AUC with 95% confidence intervals. Among structured-only models, XGBoost achieved the best test performance (ROC-AUC 0.932 [0.924-0.940]; PR-AUC 0.223 [0.193-0.262]). The combined structured and notes model improved precision-recall performance (ROC-AUC 0.932 [0.920-0.943]; PR-AUC 0.276 [0.229-0.331]). Temporal evaluation showed similar discrimination (ROC-AUC 0.929; PR-AUC 0.223). High-risk bigrams included procedural terms such as "external fixation" and "cervical discectomy." Integration of structured and text-derived features improved risk stratification compared to structured data alone. Interpretable bigram signals reflected procedural complexity and orthopedic pathology, reinforcing the clinical plausibility of model predictions. Multimodal EHR-based models accurately predict high inpatient opioid exposure and may support targeted opioid stewardship during hospitalization.

13

A Rule-Based Machine Learning Model for Predicting Virological Failure Among Children Living With HIV in Malawi

Chiphe, C.

2026-03-10 hiv aids 10.64898/2026.03.09.26347945 medRxiv

Top 0.1%

8.4%

Show abstract

Malawis HIV treatment monitoring system faces serious challenges because of a shortage of experts and reliance on viral load testing every 3 to 12 months. The process causes dangerous delays in identifying treatment failure. This leads to a higher risk of disease progression, transmission, and death. To tackle this issue, this study used a machine learning model based on association rules and combined it with clustering analysis to create a machine learning framework to identify key factors and risk profiles for virological failure among children living with HIV (CLHIV) in Malawi. The methodology combines a Random Forest classifier for feature importance, association rule mining to find predictive rules, and k-Prototype clustering for risk profiling among CLHIV. The random forest feature importance results show that Body Mass Index (BMI), CD4 count, TB status, ART regimen, gender, ART adherence, and treatment duration are major drivers of virological failure. In addition to these individual factors, the analysis produced highly reliable association rules with over 90% confidence. This establishes a framework for identifying complex risk profiles and informing focused clinical interventions. The high lift values of 4.9 across the most significant rules demonstrate the models effectiveness by revealing strong, non-random associations. Clustering analysis also identified two distinct risk profiles associated with virological failure. The k-prototype clustering model performed optimally with a cluster purity of 100% and a silhouette score of 79%.

14

Machine Intelligence-Driven Forecasting for ED Triage and Dynamic Hospital Patient Routing

Dharmavaram, S.; Bhanushali, P.

2026-02-20 emergency medicine 10.64898/2026.02.18.26346566 medRxiv

Top 0.1%

8.2%

Show abstract

Overcrowding of emergency departments (ED) is now a problem of global health care concern due to the increase in patients. Triage systems have been established for a considerable period. However, their reliability in choosing the appropriate patient and the level of service has undergone much scrutiny. In this paper, we describe a comprehensive machine learning framework aimed at predicting critical emergency department outcomes and enabling dynamic routing decisions. Through the MIMIC-IV-ED database, which comprises more than 440,000 emergency visits, we design and assess varied predictive models, which include classical clinical scores, interpretable ML systems, classical algorithms, and deep learning architectures. We investigate three significant outcomes: hospitalization post-ED visit, critical deterioration (ICU transfer/death within 12 hours), 72-hour re-attendance in ED. The results indicate that gradient boosting algorithms can make better predictions with AUROCs of 0.820, 0.881, and 0.699 as compared to standard clinical scoring systems and complex deep learning models. The interpretable AutoScore framework which combines clinical performance with clinical transparency. We also study patterns of feature importance across prediction tasks. Moreover, we talk about how these can be implemented in real-time clinical workflows. This study builds a reproducible benchmarking platform for ED prediction research. In addition, it presents evidence-based recommendations for intelligent patient routing systems that can help enhance emergency care efficiency and resource utilization while improving patient outcomes in a high-pressure environment.

15

Comparative Evaluation of Logistic Regression and Gradient Boosting Models for Influenza Outbreak Early-Warning Using U.S. CDC ILINet Surveillance Data (2010-2025)

Onwuameze, C. N.; Madu, V.

2026-03-13 health informatics 10.64898/2026.03.05.26347655 medRxiv

Top 0.1%

7.3%

Show abstract

BackgroundTimely detection of seasonal influenza outbreaks is critical for healthcare system preparedness and public health response. Although numerous studies have examined short-term influenza forecasting, fewer have operationalized prediction as a binary early-warning problem linked to actionable surveillance thresholds. This study evaluated the performance of traditional and machine learning models for detecting national influenza outbreak weeks using U.S. Centers for Disease Control and Prevention (CDC) ILINet surveillance data. MethodsWeekly national ILINet data from 2010-2025 were analyzed. Outbreak weeks were defined as those in which weighted influenza-like illness (ILIPERCENT) exceeded the 90th percentile of the 2010-2017 training distribution (threshold = 3.3932%). Predictors included three-week lags of ILIPERCENT and percent positive laboratory specimens, along with seasonal harmonic terms. Models were trained on 2010-2017 data and evaluated on a temporally held-out 2020-2025 test period. Performance metrics included area under the receiver operating characteristic curve (AUC), precision-recall area under the curve (PR-AUC), sensitivity, specificity, precision, and F1-score. FindingsOn the 2020-2025 test set, logistic regression achieved an AUC of 0.9964 and PR-AUC of 0.9868, with sensitivity of 1.0000 and specificity of 0.9516. XGBoost achieved an AUC of 0.9946 and PR-AUC of 0.9812, with sensitivity of 0.8939 and specificity of 0.9798. Both models demonstrated near-perfect discrimination between outbreak and non-outbreak weeks under strict temporal validation. InterpretationNational influenza outbreak early-warning can be implemented using publicly available CDC surveillance data with high discriminatory accuracy. Framing prediction as a threshold-based outbreak detection problem strengthens operational relevance and supports integration of predictive analytics into routine influenza surveillance and preparedness planning. Author SummarySeasonal influenza places a heavy burden on hospitals and communities each year, yet public health officials often rely on surveillance reports that describe what has already happened rather than signaling when activity is about to intensify. We examined whether routinely collected U.S. influenza surveillance data could be used to detect outbreak conditions earlier and more clearly. Using national data from the Centers for Disease Control and Prevention (CDC) covering 2010 to 2025, we compared a traditional statistical model with a machine learning approach to determine how accurately each could identify weeks when influenza activity exceeded a predefined outbreak threshold. Both approaches performed extremely well when tested on recent seasons, correctly distinguishing outbreak from non-outbreak weeks with high accuracy. Importantly, this framework translates weekly surveillance data into a practical alert signal rather than simply producing numerical forecasts. By linking model output to a clear outbreak definition, health departments and healthcare systems could use similar tools to support timely planning, communication, and resource allocation during influenza season.

16

Large language models and retrieval augmented generation for complex clinical codelists: evaluating performance and assessing failure modes

Matthewman, J.; Denaxas, S.; Langan, S.; Painter, J. L.; Bate, A.

2026-04-24 health informatics 10.64898/2026.04.23.26351098 medRxiv

Top 0.1%

7.0%

Show abstract

Objectives: Large language models (LLMs) have shown promise in creating clinical codelists for research purposes, a time-consuming task requiring expert domain knowledge. Here, we evaluate the performance and assess failure modes of a retrieval augmented generation (RAG) approach to creating clinical codelists for the large and complex medical terminology used by the Clinical Practice Research Datalink (CPRD). Materials & Methods: We set up a RAG system using a database of word embeddings of the medical terminology that we created using a general-purpose word embedding model (gemini-embedding). We developed 7 reference codelists presenting different challenges and tagged required and optional codes. We ran 168 evaluations (7 codelists, 2 different database subsets, 4 models, 3 epochs each). Scoring was based on the omission of required codes, and inclusion of irrelevant codes. We used model-grading (i.e., grading by another LLM with the reference codelists provided as context) to evaluate the output codelists (a score of 0% being all incorrect and 100% being all correct). Results: We saw varying accuracy across models and codelists, with Gemini 3 Pro (Score 43%) generally performing better than Claude Sonnet 4.6 (36%), Gemini 3 Flash, and OpenAI GPT 5.2 performing worst (14%). Models performed better with shorter target codelists (e.g., Eosinophilic esophagitis with four codes, and Hidradenitis suppurativa with 14 codes). For example, all models consistently failed to produce a complete Wrist fracture codelist (with 214 required codes). We further present evaluation summaries, and failure mode evaluations produced by parsing LLM chat logs. Discussion: Besides demonstrating that a single-shot RAG approach is currently not suitable for codelist generation, we demonstrate failure modes including hallucinations, retrieval failures and generation failures where retrieved codes are not used. Conclusions: Our findings suggest that while RAG systems using current frontier LLMs may create correct clinical codelists in some cases, they still struggle with large and complex terminologies and codelists with a large number of codes. The failure mode we highlight can inform the creation of future workflows to avoid failures.

17

Augmenting Electronic Health Records for Adverse Event Detection

Kaynar, G.; You, Z.; Boyce, R. D.; Yakoh, T.; Kingsford, C.

2026-02-11 health informatics 10.64898/2026.02.10.26345962 medRxiv

Top 0.1%

6.9%

Show abstract

ObjectiveAdverse events (AEs) resulting from medical interventions are significant contributors to patient morbidity, mortality, and healthcare costs. Prediction of these events using electronic health records (EHRs) can facilitate timely clinical interventions. However, effective prediction remains challenging due to severe class imbalance, missing labels, and the complexity of EHR records. Classical machine learning approaches frequently underperform due to insufficient representation of minority adverse event classes and limited capacity to capture interactions among patient demographics, administered medications, and associated complications. MethodsWe introduce TASER-AE, a novel data augmentation pipeline tailored for structured EHR data, coupled with transformer-based classification. TASER-AE addresses these issues through an NLP-inspired data augmentation framework adapted for EHR, enabling effective minority-class representation in sparse and imbalanced clinical datasets. The augmented records produced by TASER-AE alleviate class imbalance by enriching the representation of minority adverse event classes, which enhances the robustness and predictive performance of the classifier. ResultsTASER-AE yields minority-class F1 scores up to 0.70, substantially surpassing classical machine-learning baselines and prior augmentation methods across multiple adverse event tasks. Experiments conducted on two distinct EHR datasets confirm TASER-AEs ability to substantially improve adverse event detection performance. ConclusionThese results demonstrate the potential of structured, NLP-inspired augmentation methods to overcome data limitations in clinical predictive modeling, ultimately contributing to improved patient safety outcomes. TASER-AE is available at https://github.com/Kingsford-Group/taserae.

18

Data-Driven Hybrid Model of SARIMA-CNNAR For Tuberculosis Incidence Time Series Analysis in Nepal

Singh, D. B.; Dawadi, P. R.; Dangi, Y.

2026-02-24 health informatics 10.64898/2026.02.22.26346853 medRxiv

Top 0.1%

6.9%

Show abstract

BackgroundTuberculosis (TB) remains a major public health challenge in Nepal, with incidence rates substantially higher than global estimates. Accurate forecasting of TB incidence is essential for early warning systems, resource allocation, and targeted interventions. This study aimed to develop and validate a hybrid Seasonal Autoregressive Integrated Moving Average (SARIMA) and Convolutional Neural Network Auto-Regressive (CNNAR) model for TB incidence forecasting in Nepal. MethodsMonthly TB incidence data (January 2015 to December 2024) were obtained from the National Tuberculosis Control Center (NTCC), Nepal. A hybrid SARIMA-CNNAR model was developed, where SARIMA modeled linear seasonal trends and CNNAR captured nonlinear patterns in the residuals. Hyperparameters were optimized using grid search with 5-fold cross-validation. Model performance was evaluated using Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), and R2 on the 2024 test set. Structural break analysis and sensitivity analysis assessed model robustness. The hybrid model was compared against standalone SARIMA, CNNAR, and three state-of-the-art benchmarks: Long Short-Term Memory (LSTM), Facebook Prophet, and XGBoost. ResultsTB incidence in Nepal increased from a monthly average of 2,048 cases in 2015 to 3,447 in 2024 (68.4% increase). The hybrid SARIMA-CNNAR model demonstrated strong performance with test set metrics of MAE=248.35, RMSE=294.31, MAPE=7.2%, and R2=0.79. Comparative performance: CNNAR (MAE=251.08, RMSE=336.55, MAPE=7.7%, R2=0.73); LSTM (MAE=267.91, RMSE=324.55, MAPE=7.5%, R2=0.75); XGBoost (MAE=314.74, RMSE=373.99, MAPE=8.5%, R2=0.66); Prophet (MAE=371.15, RMSE=478.40, MAPE=10.4%, R2=0.45); SARIMA (MAE=401.11, RMSE=503.93, MAPE=10.99%, R2=0.39). All models captured seasonal peaks in March-May and July-August, with forecasts for 2025 indicating continued seasonal patterns. Sensitivity analysis confirmed robustness with <5% metric variation across parameter configurations. ConclusionsThis first validated hybrid model for TB prediction in Nepal demonstrates high forecasting accuracy by integrating linear seasonal modeling with nonlinear pattern detection. The approach offers a robust tool for evidence-based public health planning in resource-limited settings and it is suitable for integration into national surveillance systems. Author SummaryTuberculosis remains a major public health challenge in Nepal, with cases increasing substantially over the past-decade. In this study, we developed a computer model that combines two different forecasting ap proaches: one that captures regular seasonal patterns and another that learns complex trends from data to predict monthly TB cases. Using ten years of national surveillance data, our hybrid model achieved high accuracy in forecasting TB incidence, outperforming standard approaches including SARIMA, PROPHET, CNNAR, LSTM neural networks, and XGBoost. The model successfully predicted seasonal peaks in March-May and July-August, with forecasts for 2025 suggesting continued high case numbers. These predictions can help Nepals health authorities prepare by pre-positioning diagnostic supplies, scheduling additional staffs during peak months, and targeting awareness campaigns. The modeling approach is desig ned to be adaptable for other diseases and countries with similar health data.

19

Development of a Deep Learning Based Framework for Classification of Indian Venomous Snakes Integrated with Explainable Artificial Intelligence for primary and emergency care providers

Manna, I. I. A.; Wagle, U.; Balaji, B.; Lath, V.; Sampathila, N.; Sirur, F. M.; Upadya, S.

2026-03-18 emergency medicine 10.64898/2026.03.16.26348471 medRxiv

Top 0.1%

6.6%

Show abstract

BackgroundSnakebite envenoming is a significant global health crisis that has been long neglected as a global health priority. It is a huge problem for rural communities of low and middle-income countries, India accounts for the largest proportion of snakebite deaths globally. Timely identification of venomous snakebite and its syndromic pattern is essential for effective administration of antivenom and supportive treatment. Expert identification of snake species and syndromes is not always available in peripheral healthcare settings. This leads to delays, unnecessary referrals, or improper treatment choices. Additionally, diverse snake species distribution and venom variations across regions pose challenges. AI-powered image classification methods can help overcome these barriers. We propose a clinically oriented deep learning pipeline for binary classification of venomous and non-venomous snake species of India using real-world imagery data. This pipeline would serve as a baseline step towards aiding snakebite management at peripheral healthcare setups with scarce resources. MethodsThe selected dataset consisted of 20 medically important Indian species. MobileViT-S, ConvNeXt-Tiny, EfficientNet-V2-S and ResNeXt-50 (32x4d) were trained under same conditions for comparison of results. Model interpretability was evaluated using Grad-CAM ++ to ensure that classification was not performed based on background but on features like head shape and stripes present on body. For reliable implementation we connected it to a web interface with human in loop expert verification. Experts can confirm or override predictions in real time. ResultsAmong the evaluated architectures, ResNeXt-50 (32x4d) showed the most reliable and consistent performance in classifying venomous and non-venomous snakes. It achieved the highest test accuracy, sensitivity, specificity, and F1-score. The model also had strong discriminative ability, with a ROC-AUC of 0.9950 and PR-AUC of 0.9959. These results indicate dependable performance in safety-critical screening situations. Grad-CAM++ visualizations confirmed that predictions were based on anatomically relevant features, especially in the head and body contour areas. This supports model interpretability and reduces background bias. ConclusionsAlthough the dataset size and single-institution source limit how widely the results can be applied, the proposed framework shows that its possible to create a clinically oriented, ready-to-use deep learning system for snakebite triage support. This system is intended as a scalable tool to help rural healthcare workers, emergency responders, and telemedicine platforms in areas where snakebites are common. Author SummarySnakebite is a major public health concern that disproportionally affects the rural population. Delays in identifying whether a snake is venomous often lead to delayed treatment, unnecessary use of antivenom, or inappropriate referrals. In many rural settings, access to expert snake identification is limited. To address this gap, authors have developed an artificial intelligence (AI)-based image classification system that distinguishes snakes into two clinically relevant categories: venomous or non-venomous. Unlike many previous studies that focused on ideal, high-quality wildlife images, our model was trained using real-world photographs captured in emergency situations, including images taken by patients and field responders under variable lighting and background conditions. This approach improves the models relevance to practical healthcare settings. The system achieved high accuracy and was further strengthened by visual interpretability tools and expert verification to ensure reliability. By combining AI-assisted classification with human oversight, this work provides a scalable decision-support tool that may improve early triage, rational antivenom use, and surveillance in snakebite-endemic regions

20

Leveraging large language models to address common vaccination myths and misconceptions

Reis, F.; Bayer, L. J.; Malerczyk, C.; Lenz, C.; von Eiff, C.

2026-03-02 health informatics 10.64898/2026.02.27.26347254 medRxiv

Top 0.2%

6.4%

Show abstract

Large language models (LLMs) are increasingly used by the public to seek health information, yet their reliability in addressing common vaccine myths remains unclear. We conducted an exploratory multi-vendor evaluation of three LLMs (GPT-5, Gemini 2.5 Flash, Claude Sonnet 4) using officially curated vaccination myths from Germanys public health institution and two realistic user framings as prompts: a curious skeptic and a convinced believer. All model responses were independently evaluated by two blinded medical experts for misconception addressal (binary), scientific accuracy, and communication clarity (5-point Likert scales). Additionally, blinded marketing experts ranked models for lay communication clarity, and Flesch-Kincaid Reading Ease scores were computed for all outputs. Across all myths, prompts, and models (11 x 2 x 3 = 66 rating items), medical raters found 100% successful refutation of misinformation. Scientific accuracy and clarity ratings were high and tightly clustered (median 4.0-4.5), with no combined score below 3 and substantial inter-rater agreement. Marketing experts independently ranked Gemini 2.5 Flash and GPT-5 highest for lay clarity, with Claude Sonnet 4 consistently less favored. Readability analysis revealed generally low accessibility, particularly for the convinced believer framing and for Claude Sonnet 4 outputs. Our findings suggest that current general-purpose LLMs can deliver accurate debunking of widely documented vaccine myths under realistic conditions, but that linguistic complexity and framing-sensitive style may limit accessibility. Careful integration of LLMs into public health channels, alongside transparent sourcing and readability optimization, could enable these models to be used as scalable tools for debunking vaccine myths.